fix(#1442): scan_*_references reads raw JSON metadata instead of decoded codec output by kushalbakshi · Pull Request #1444 · datajoint/datajoint-python

kushalbakshi · 2026-04-30T15:50:23Z

Summary

Fixes #1442. gc.scan_hash_references and gc.scan_schema_references were calling table.to_arrays(attr_name), which routes through Expression.to_dicts → decode_attribute → the codec's decode(). For the storage codecs (<blob@>, <hash@>, <attach@>, <npy@>, <object@>) that returns the deserialized payload — numpy.ndarray, NpyRef, ObjectRef, bytes, or local file-path str. None of those satisfy _extract_*_refs's isinstance(value, dict) and "path" in value check, so both helpers silently returned empty reference sets and collect() would have classified live data as orphaned.

This PR replaces the call with table.proj(attr_name).cursor(as_dict=True) in both helpers. The cursor yields the raw JSON column value: a Python dict on PostgreSQL (JSONB auto-decoded) or a JSON string on MySQL — both already handled by _extract_*_refs (gc.py:138 string branch, gc.py:145 dict branch). Backend-agnostic, custom-codec-safe, and turns scan into a metadata-only operation (no more downloading every external blob just to discard the deserialized result).

Also registers gc in _lazy_modules (src/datajoint/__init__.py) so dj.gc.scan(...) works as both the module docstring and the user docs at how-to/garbage-collection.md invoke it. Pattern matches the existing "diagram": (".diagram", None) entry.

Test plan

pytest tests/integration/test_gc.py -k "not TestScanWithLiveData" — 26 existing mocked tests pass unchanged (orchestration coverage stays intact).
pytest tests/integration/test_gc.py::TestScanWithLiveData -v — 3 new non-mocked e2e tests pass with the fix:
- test_scan_finds_active_blob_reference — <blob@> (decode → numpy.ndarray)
- test_scan_finds_active_npy_reference — <npy@> (decode → NpyRef)
- test_scan_finds_active_object_reference — <object@> (decode → ObjectRef)
Regression-proven: the same 3 e2e tests fail when gc.py is reverted to the buggy version. The failure mode reproduces the issue's exact symptom — hash_referenced: 0, schema_paths_referenced: 0, every file classified as orphaned.
pre-commit run (ruff, ruff-format, codespell) clean.

Test gap closed

tests/integration/test_gc.py:163-166 patches scan_hash_references, scan_schema_references, and list_*_paths directly, so the production path through to_arrays → decode_attribute was never exercised end-to-end. That's why this slipped through. The mocked tests stay (orchestration: composition with list_*_paths, dry-run vs real, stat-key shape, format strings). The new TestScanWithLiveData class fills the gap with one e2e test per structurally distinct decoded-value type.

Performance side-effect (free byproduct)

Before: scan downloaded every external blob just to deserialize and discard it via the silent type mismatch — a 1 TB schema produced 1 TB of egress to report referenced: 0. After: scan touches only the JSON column on the database during reference enumeration.

Custom-codec authors

Reference discovery now operates on raw stored metadata regardless of what a codec's decode() returns. Third-party codecs following the documented encode/decode contract get correct GC for free — no contract change required.

Out of scope

GC remains non-transaction-safe even after this fix — there is a TOCTOU window between scan and delete during which a concurrent transaction could insert a row referencing what looks like an orphan. A two-phase retrieval/removal API (quarantine → grace window → purge) is the right remedy and will be filed as a separate enhancement issue right after this PR opens.

🤖 Generated with Claude Code

…d of decoded codec output `gc.scan_hash_references` and `gc.scan_schema_references` were calling `table.to_arrays(attr_name)`, which routes through `Expression.to_dicts` (`expression.py:899`) and runs `decode_attribute` (`codecs.py:518`) for each value. For the storage codecs (`<blob@>`, `<hash@>`, `<attach@>`, `<npy@>`, `<object@>`) this means downloading the bytes from external storage and deserializing them into the codec's runtime type — a `numpy.ndarray`, an `NpyRef`, an `ObjectRef`, raw bytes, or a local file-path string — none of which satisfy `_extract_*_refs`'s `isinstance(value, dict) and "path" in value` check. Both helpers therefore silently returned empty reference sets. Every populated schema reported `hash_referenced: 0` and `schema_paths_referenced: 0`, so every external file looked orphaned to `scan()` and a subsequent `collect()` would have deleted live data. The broad `try/except Exception` around the loop never fired because no exception was raised — `_extract_*_refs` returns `[]` for unrecognized shapes by design. The intended design (per `reference/specs/type-system.md`) is for GC to operate on the raw stored JSON metadata, not the decoded payload. Replace `table.to_arrays(attr_name)` with `table.proj(attr_name).cursor(as_dict=True)` in both helpers. The cursor yields the JSON column value directly: a Python dict on PostgreSQL (JSONB auto-decoded) or a JSON string on MySQL. `_extract_*_refs` already handles both shapes (`gc.py:138` string branch, `gc.py:145` dict branch), so this is backend-agnostic with no adapter dispatch. Side effect — `scan` is now a metadata-only operation. Previously it downloaded every external blob just to deserialize and discard the result via the silent type mismatch; on a 1 TB schema that meant 1 TB of egress to produce `referenced: 0`. After this change, scan touches only the JSON column on the database. Custom-codec authors are unaffected: reference discovery operates on the raw stored metadata regardless of what the codec's `decode()` returns, so third-party codecs following the documented `encode`/`decode` contract get correct GC for free. Tests ----- The existing `tests/integration/test_gc.py` mocks `scan_hash_references`, `scan_schema_references`, and `list_*_paths` directly, so the production code path through `to_arrays` → `decode_attribute` was never exercised end-to-end. The mocked tests stay (they cover orchestration: composition with `list_*_paths`, dry-run vs real, stat-key shape, format strings). Add a `TestScanWithLiveData` class with three non-mocked end-to-end tests, one per structurally distinct decoded-value type: - `test_scan_finds_active_blob_reference` — `<blob@>` (decode → ndarray) - `test_scan_finds_active_npy_reference` — `<npy@>` (decode → NpyRef) - `test_scan_finds_active_object_reference` — `<object@>` (decode → ObjectRef) Each declares a small manual table, inserts one row, and asserts `scan(schema, store_name='local')` reports the expected `*_referenced` count > 0. Verified to fail on the pre-fix code: `{'hash_referenced': 0, 'hash_stored': 1, 'hash_orphaned': 1, ...}`. Adjacent -------- Register `gc` in `_lazy_modules` (`src/datajoint/__init__.py`). The `gc.py` module docstring and the user docs at `how-to/garbage-collection.md` both invoke GC as `dj.gc.scan(...)`, which previously raised `AttributeError` because `gc` wasn't lazily exposed at the package level. Pattern matches the existing `"diagram": (".diagram", None)` entry. Out of scope ------------ GC remains non-transaction-safe even after this fix — there is a TOCTOU window between scan and delete during which a concurrent transaction could insert a row referencing what looks like an orphan. A two-phase retrieval/removal API (quarantine → grace window → purge) is the right remedy and will be tracked as a separate enhancement issue. Fixes datajoint#1442

dimitri-yatsenko

Approving — the fix is correct, the test scaffold proves the regression, and the framing is right.

Fix shape. table.proj(attr).cursor(as_dict=True) bypasses decode_attribute and yields the raw JSON column value — dict on PostgreSQL/JSONB, string on MySQL. Both shapes already handled by _extract_*_refs (gc.py:138 string branch, gc.py:145 dict branch). Custom-codec authors keep the documented encode/decode contract; reference discovery now works regardless of what decode() returns. As a free byproduct, scan becomes metadata-only — no more downloading every external blob to discard the deserialized result.

Test gap closed honestly. The pre-existing tests at test_gc.py:163-166 patched the scan helpers and list_*_paths, so the production path was never exercised. TestScanWithLiveData fills that with three e2e tests against real schemas with live external storage, one per structurally distinct decoded-value type (<blob@> → ndarray, <npy@> → NpyRef, <object@> → ObjectRef). Per the PR body, the same three tests fail on the buggy version — that's the right way to prove a regression test.

Lazy module registration of gc in __init__.py is a clean addition matching the existing diagram pattern and unblocks dj.gc.scan(...) as the docs already invoke it.

Out-of-scope TOCTOU is honestly flagged. The two-phase quarantine→grace-window→purge enhancement is the right next step; please file the follow-up issue once this lands.

Two minor notes:

<attach@> (decodes to local-path str) and <hash@> aren't in the e2e suite. Not blocking — the cursor approach is decoded-type-agnostic, so the three tests already prove the mechanism. Worth one line in the PR body acknowledging the scope choice.
Strip the 🤖 trailer and any Co-Authored-By line at squash-merge time, per project convention.

This is a real data-loss recovery for users with <blob@>/<npy@>/<object@> columns and GC active. Land it.

kushalbakshi mentioned this pull request Apr 30, 2026

Two-phase, transaction-safe garbage collection (quarantine -> grace -> purge) #1445

Open

kushalbakshi requested a review from dimitri-yatsenko April 30, 2026 15:55

dimitri-yatsenko approved these changes May 1, 2026

View reviewed changes

dimitri-yatsenko merged commit 53dd595 into datajoint:master May 1, 2026
4 checks passed

dimitri-yatsenko added the bug Indicates an unexpected problem or unintended behavior label May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#1442): scan_*_references reads raw JSON metadata instead of decoded codec output#1444

fix(#1442): scan_*_references reads raw JSON metadata instead of decoded codec output#1444
dimitri-yatsenko merged 1 commit intodatajoint:masterfrom
kushalbakshi:fix/1442-gc-scan-uses-decoded-values

kushalbakshi commented Apr 30, 2026

Uh oh!

dimitri-yatsenko left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kushalbakshi commented Apr 30, 2026

Summary

Test plan

Test gap closed

Performance side-effect (free byproduct)

Custom-codec authors

Out of scope

Uh oh!

dimitri-yatsenko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants